Question 1:

The Bay Asthma Map shows high concentrations in and around the inner bay ring from the Peninsula (Palo Alto, San Mateo, and San Francisco) to the Northern and Western borders of the East Bay Area. This Asthma data is the spatially modeled and age-adjusted rate of ED (emergency room) visits per 10,000 individuals over the years of 2015-2017.

The Bay PM2.5 Map shows a significantly higher presence in the bay area than Asthma. That said, the concentrations of PM2.5 in the bay follow a similar trend to the Asthma data in that they also appear highest around the inner coast of the Peninsula across the Palo Alto, San Mateo, and San Francisco Area, while becoming even more increasingly prevalent in the East bay in around Alameda, particular along its Western coastline. There is also an interestingly high pocket of measured PM2.5 in Napa county. The collected data, according to Calenviroscreen, is measured as the ‘Annual mean concentration of Bay area PM2.5–that is, the ’weighted average of measured monitor concentrations and satellite observations (of) µg/m3’–from 2015-2017.

Question 2:

the Asthma-PM2.5 data, when plotted behind a best-fit line appears as a boat-like shape, with a significant portion making up the boat body itself and another line of data perched like a sail and mast. The regression line does not appear to fit the data well and appears to be pulled upwards in the positive slope direction by this cluster of “sail” data points in the graph.

Question 3: In layman’s terms, the relationship between PM2.5 levels and Asthma levels in the bay area appears significant with a p-value of practically 0 for our intents and purposes. The residual min and max values are not symmetrical (-48.453 and 178.672 respectively) which calls into question the normality of the distribution. Per the r-squared result, the variation in PM2.5 levels explains 9.6% of the variation in Bay Asthma levels. Finally, an increase of 1 in PM2.5-scores/levels is associated with a 15.33 point increase in Asthma levels/scores.

Question 4: The residuals density plot has a significant right tail or skew which calls into question the plot’s normality. Further, the plot is bimodal and not particularly centered around zero, which means it needs to be normalized to mitigate much of this skew. I find this pattern interesting as we are not using any binary variables here.

After repeating steps 2-3 under the log() criteria:

Question 2(repeated): the Asthma-PM2.5 data, when plotted along with a best-fit line appears far more regular in shape. The upward trending “sail” data I spoke of earlier is far tamer in this distribution, with a more regular shape in my opinion. The data still does not show but the slightest positive trend in my eyes, which calls into question the strength of the correlation we are observing.

Question 3(repeated):

Again, the relationship between PM2.5 levels and Asthma levels in the bay area appears significant with a p-value of essentially 0. The residual min and max values are not symmetrical, although smaller in value, (-1.99922 and 0.40092 respectively) which still calls slightly into question the normality of the distribution. Per the r-squared result, the variation in PM2.5 levels explains 8.5% of the variation in Bay Asthma levels. Finally, an increase of 1 in PM2.5-scores/levels is associated with an increase of .27 increase in Asthma levels/scores, per the re-ran regression.

Question 5: The new residuals density is centered around zero and normalized; however, it is still bimodal.

I determined that the census tract with the most negative “residual” is actually manifest as a tie between two Stanford tracts (6085513000 and 6085511608) in Santa Clara County.

A negative residual means that the model overestimated the actual observation. This means that the model overestimated the number of people with Asthma in this census tract based on the aggregated Bay area data. In Stanford’s case, this may be an overestimation as the residents here do not live in the Bay Area full time and do not represent the asthma/PM2.5 level association as accurately as other full-time residents would. In other words, Stanford is a diverse, young, and geographically transient population which may explain why the model would predict a higher ‘Asthma’ value than what is observed.

## 
## Call:
## lm(formula = Asthma ~ PM2.5, data = fin_bay_ces4_map)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -48.453 -22.802  -6.947  11.971 178.672 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -80.677      9.997   -8.07 1.38e-15 ***
## PM2.5         15.335      1.186   12.93  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 32.94 on 1581 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.09557,    Adjusted R-squared:  0.09499 
## F-statistic: 167.1 on 1 and 1581 DF,  p-value: < 2.2e-16
## [1] -1.95044e-12